Content oriented retrieval on document centric XML

نویسنده

  • Philipp Dopichaj
چکیده

XML is the perfect format for storing (mostly) textual documents in a digital library; its flexibility enables users to store both highly structured data (like database records) and free text in the same document. The data-centric parts can be searched using query languages like XPath and XQuery, where exact conditions on the structure can be imposed. For digital libraries, however, it is important to be able to search the free-text parts effectively. Standard information retrieval systems would return complete documents as retrieval results, which is useful for short documents such as web pages. If the document collection consists of books, however, this result granularity is too coarse. Users should be able to find the information that helps them solve their problem without having to wade through much information that is not relevant for their problem. To this end, content-oriented XML retrieval can help: In content-oriented XML retrieval, documents are not considered atomic entities as they are in traditional text-based information retrieval. A retrieval result can not only contain complete documents, but also parts of documents such as chapters or paragraphs. This thesis investigates several major aspects of content-oriented retrieval on XML documents: • The adaptation of standard information retrieval techniques to XML. Although the adaptation is mostly straightforward, but several peculiarities of XML have to be taken into account. • A space-efficient implementation of this base retrieval engine using customized index structures. Although the base retrieval engine uses the same concept of similarity as standard information retrieval systems, it is possible to take the XML structure into account when indexing to save space and time. • A novel method for improving retrieval quality by making use of the document structure. In particular, section titles are exploited for finding highly relevant sections in the documents. • A detailed evaluation of the retrieval quality for the base retrieval system and the proposed method. This evaluation is based on my own implementation of the retrieval system and a standard benchmark. • Preliminary ideas for searching data-centric parts of the documents. The only major part missing for a usable retrieval system is the user interface.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Apply Uncertainty in Document-Oriented Database (MongoDB) Using F-XML

As moving to big data world where data is increasing in unstructured way with high velocity, there is a need of data-store to store this bundle amount of data. Traditionally, relational databases are used which are now not compatible to handle this large amount of data, so it is needed to move on to non-relational data-stores. In the current study, we have proposed an extension of the Mongo...

متن کامل

Apply Uncertainty in Document-Oriented Database (MongoDB) Using F-XML

As moving to big data world where data is increasing in unstructured way with high velocity, there is a need of data-store to store this bundle amount of data. Traditionally, relational databases are used which are now not compatible to handle this large amount of data, so it is needed to move on to non-relational data-stores. In the current study, we have proposed an extension of the Mongo...

متن کامل

XML Information Retrieval - Achievements and Challenges

Data-centric view: XML as exchange format for structured data Document-centric view: XML as format for representing the logical structure of documents XML Information Retrieval — Achievements and Challenges – p. 2/42 Data-centric view: XML as exchange format for structured data Document-centric view: XML as format for representing the logical structure of documents This talk: focus on document-...

متن کامل

Context Driven XML Retrieval

This paper presents a data-centric approach to XML information retrieval which benefits from XML document structure and adapts traditional text-centric information retrieval techniques to deal with text content inside XML. We implement our ideas in a configurable, general purpose XML retrieval library which can be tuned to operate on multilingual XML resources with different structure and can b...

متن کامل

Processing Content-And-Structure Queries for XML Retrieval

Document-centric XML collections contain text-rich documents, marked up with XML tags. The tags add lightweight semantics to the text. Querying such collections calls for a hybrid query language: the text-rich nature of the documents suggest a content-oriented (IR) approach, while the mark-up allows users to add structural constraints to their IR queries. We propose an approach to such hybrid c...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008